Statistical Science 

2007, Vol. 22, No. 3, 359-362 

DOI: 10.1214/07-STS235B 

Main article DOI: 10. 1214/07-STS235 

© Institute of Mathematical Statistics, 2007 



00 

o 

O 

(N 

IX, 



c3 



> 

in 
o 

O 

oo 
O 



X 



Comment: Bayesian Checking of the 
Second Level of Hierarchical Models: 
Cross- Validated Posterior Predictive 
Checks Using Discrepancy Measures 



Michael D. Larsen and Lu Lu 



1. INTRODUCTION 

We compliment Bayarri and Castellanos (BC) on 
producing an interesting and insightful paper on 
model checking applied to the second level of hierar- 
chical models. Distributions of test statistics (func- 
tions of the observed data not involving parameters) 
for judging appropriateness of hierarchical models 
typically involve nuisance (i.e., unknown) parame- 
ters. BC (2007) focus on ways to remove the depen- 
dency on nuisance parameters so that test statis- 
tics can be used to assess models, either through p- 
values or Berger's relative predictive surprise (RPS). 
They demonstrate shortcomings in terms of very low 
power of posterior predictive checks and a posterior 
empirical Bayesian method. They also demonstrate 
better performance of their partial posterior predic- 
tive (ppp) method over a prior empirical Bayesian 
method. Methods of Dey et al. (1998), O'Hagan 
(2003) and Marshall and Spiegelhalter (2003) also 
are compared. 

Methods are contrasted in terms of whether they 
require proper prior distributions, how many mea- 
sures of surprise (one per group or one total) are pro- 
duced, and the degree to which data are used twice 
in estimation and testing. Their preferred method 
(ppp) can use improper prior distributions, which 
are referred to as objective, produces a single mea- 
sure of surprise for each test statistic, and avoids 
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double use of the data. For the models and statis- 
tics considered, in comparison to the alternatives 
presented, ppp has a more uniform null distribution 
of p-values and more power versus alternatives. 

In this discussion, we suggest that cross-validated 
posterior predictive checks using discrepancy mea- 
sures hold some promise for evaluating complex mod- 
els. We apply them to O'Hagan's data example, pro- 
vide some comments on the paper and discuss pos- 
sible future work. 

2. CROSS-VALIDATED POSTERIOR 
PREDICTIVE CHECKS USING DISCREPANCY 
MEASURES 

Suppose there are data for / groups: Xi, i = 1, . . . , I, 
where Xi = (Xij,j = l,...,ni). The unknown pa- 
rameters in the first level in group i are Of f(Xi\9i) 
independently. The parameters in the second level 
of the model are rj: ir(9\rj) = Ilf=i ^(^l 7 ?)- The prior 
distribution on rj is ir(ri). Let D(X,9,rj) be a gen- 
eralized discrepancy measure. If D(X, 9, rj) = D(X), 
then it is a test statistic. Examples are given in the 
next section for the normal-normal model consid- 
ered by BC (2007). Cross- validated posterior pre- 
dictive model checking using a discrepancy mea- 
sure is implemented as follows. Separately for each 
i = 1 1: 



, M) from the pos- 



call them 



1. Generate M values (m = 1, . 
terior distribution of rj\Xf_iy. 
where Xi_j\ represents all the data without group 
i. Generating values of r\ will be accomplished in 
many cases through iterative simulation meth- 
ods that will generate values of where 0(-i) 
is the collection of group parameters excluding 
group i: f('q\X { _ i) ) = J f (rj , 9 (_;)) d6 oc 
fx(r l )ir(9 { _ i) \r l )f(X { _ i) \9 i _i ) )d9 i _ i) . 

2. Generate values 9™ of 9i given the hyperparam- 
eters V^-i) independently from vr(0j|r/^^), m = 
1,...,M. 
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Table 1 

Posterior predictive p-values for individual groups and the whole population 



Discrepancy 




Group 1 


Group 2 


Group 3 


Group 4 


Group 5 


Whole populatior 


Overall X 2 




0.568 


0.857 


0.261 


0.747 


0.287 


0.483 


1st Level X 2 




0.547 


0.893 


0.140 


0.893 


0.202 


0.496 


2nd Level X 2 




0.512 


0.594 


0.567 


0.518 


0.403 


0.513 


Max je{i,...,n,} 


Xij 


0.476 


0.851 


0.060 


0.847 


0.143 




Max je{i,...,ni} 


\Xij — 0i\ 


0.610 


0.839 


0.113 


0.923 


0.283 






\Xa - Ml 


0.682 


0.820 


0.286 


0.897 


0.151 




Max; \Xi - n\ 














0.493 



3. Generate replicate data X™ independently from 
f(Xi\er), m = l,...,M. 

4. Compute the proportion of times out of M that 
D(Xf l ,6Y l ,rf^ i) ) is greater than D(x h 6>f\ V™i)), 
m = l,...,M. 

This proposal allows the use of objective prior dis- 
tributions, is relatively easy to implement in many 
hierarchical models, avoids double use of data in 
group i for evaluating the model for group i, and al- 
lows many test statistics and discrepancy measures 
to be used based on one set of simulations of rj and 9. 
On the negative side, this procedure may lose some 
power for some statistics compared with ppp, but 
likely much less so than regular posterior predic- 
tive checks. The use of more flexibly defined dis- 
crepancies, however, could produce relatively pow- 
erful evaluations for some aspects of some models. 
The proposal requires more computing than regular 
posterior predictive checks and faces issues of multi- 
plicity in testing. The method is applied in Section 3 
and followed by discussion in Section 4. 

3. O'HAGAN'S EXAMPLE 

O'Hagan's data [see Section 5 of BC (2007)] are 
used to study the performance of model checking 
based on regular and cross-validated posterior pre- 
dictive checks utilizing various discrepancy measures. 
The model being fit is a two-level normal-normal 
hierarchical model. Notation is the same as in BC 
(2007). 

Different discrepancy measures relate to various 
parts of the model. The overall X 2 discrepancy, de- 
fined by Y^jLi ^2 + ^2) f° r group i, measures the ad- 
equacy of two levels clS Si whole. The first and sec- 
ond level X 2 discrepancies, defined as Y^Li ^ Xz3 a $^ 

and ^'"/^ for group i, detect the inadequacy of 
the first- and second-level models, respectively. The 



three measures above also can be summed across 
groups, i = 1,. . . ,1. The maximum absolute devia- 
tion of a group average from the overall center is 
Maxj \X{ — fi\ and quantifies fit of the whole model. 
The maximum value Max je { lj n .} X^ and the min- 
imum value Min Jg { li n .} Xij in group i are sensitive 
to extremes within groups. The maximum absolute 
deviations of observations from the group mean in 
group i, Max^/i ~ $«l> relates to spread 

about the mean within group i. The maximum abso- 
lute deviation of observations from the overall mean 
in group i, MaXj g .n nA \Xij — fi\, relates to ade- 
quacy of both levels in the model. 

For the regular posterior predictive checks non- 
informative prior distributions for parameters a 2 , 
fi and t 2 were used: 7r(/z) oc 1, vr(cr 2 ) oc l/cr 2 and 
7r(r 2 ) oc 1/t (or equivalently tt(t) oc 1). Table 1 shows 
the posterior predictive p-values for individual groups 
and the whole population. As observed by BC (2007), 
suffering from the double use of data, none of the 
discrepancy measures detect any evidence of incom- 
patibility between the observed data and the null 
model for individual groups or for the population as 
a whole. 

Table 2 shows the p-values based on cross- validated 
posterior predictive checks for individual groups. The 
model fits the data from groups 1, 2 and 4 very well. 
For group 3, the p-values based on the first-level X 2 
discrepancy is 0.016, which indicates slight inade- 
quacy of the first-level model. This is not surprising 
due to the extreme observation 4.10. The impact of 
this unusual observation in group 3, given a model 
of equal spread in each group, also is detected by the 
discrepancy measure MaXjg.n \Xij — 6i\, which 
has a p- value of 0.023. Despite the concern about the 
first-level model in group 3, discrepancy measures 
focused on the second level and the model overall 
do not detect any problem. This is consistent with 
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Table 2 

Cross-validated posterior predictive p-values for individual groups 



Discrepancy 




Group 1 


Group 2 


Group 3 


Group 4 


Group 5 


Overall X 2 




0.653 


0.804 


0.520 


0.730 


0.007 


1st Level X 2 




0.168 


0.315 


0.016 


0.291 


0.000 


2nd Level X 2 




0.577 


0.656 


0.654 


0.585 


0.007 


Max 3€ { 1? . ..,„,.} 




0.641 


0.723 


0.373 


0.759 


0.005 


Max je{li ... jni} 


\Xij - 0i\ 


0.203 


0.333 


0.023 


0.411 


0.002 


Maxjg-t!,...,^} 


\Xij - n\ 


0.715 


0.819 


0.472 


0.841 


0.006 



the fact that the mean and spread in group 3 are 
not extreme compared with the other groups. 

For group 5, all discrepancies detect the inade- 
quacy of the hierarchical model. This makes sense 
since group 5 has a very extreme group mean of 
4.44, which is almost three times the other group 
means, and has at least one relatively extreme ob- 
servation of 6.32, which is almost twice the over- 
all within-group standard deviation away from the 
group mean. Note that even if p-values for group 5 
were multiplied by 5 or 6 to deal with multiplicity 
of testing, the result would still be less than 0.05 for 
all the various discrepancies. 

Now we consider improving the proposed hier- 
archical model by using more robust distributions 
for modeling the outlying group and extreme ob- 
servations. Since we have seen slight inadequacy in 
the first-level model for groups 3 and 5 and seri- 
ous inadequacy in the second-level model for group 
5, we might consider using Student-t distributions 
to accommodate the unusual observations and the 
extreme group mean parameter in the hierarchical 
model. 

To perform a robust analysis, we replace the nor- 
mal distributions by Student-t distributions with 
fixed degrees of freedom v\ = 3 and v<i = 2.2 in the 
first and second levels of the hierarchical model. The 



cross-validated posterior predictive p-values assum- 
ing Student-t distributions in both levels of model 
are shown in Table 3. The two- level robust Student- 
t model successfully accommodates the unusual ob- 
servation in group 3 and almost accommodates the 
extreme observation in group 5. But it does not fully 
address the inadequacy of the second-level model for 
fitting group 5's data. Given this result, one might 
suggest treating group 5 as being generated from a 
normal distribution with a shifted location param- 
eter or an inflated variance parameter. One could 
also consider using another model, such as one of 
BC's (2007) alternative models in their Section 3.6. 
If there were more groups with higher means, then 
fitting a mixture of normal distributions in the sec- 
ond level might be an option. 

Degrees of freedom greater than 2 are used be- 
cause such i-distributions have finite variances. A 
little bit of experimenting was done to choose the 
degrees of freedom. Larger degrees of freedom had 
less success (slightly) of fitting the data, but made 
little difference in posterior distributions of param- 
eters or in results in Table 3. If the degrees of free- 
dom are thought of as parameters, then posterior 
variance will be quite high with this few groups. 



Table 3 

Cross-validated posterior predictive p-values for individual groups assuming Student-t 
distributions for both levels in the hierarchical model 



Discrepancy 


Group 1 


Group 2 


Group 3 


Group 4 


Group 5 


Overall X 2 


0.680 


0.856 


0.493 


0.822 


0.074 


1st Level X 2 


0.211 


0.376 


0.081 


0.381 


0.060 


2nd Level X 2 


0.636 


0.676 


0.667 


0.639 


0.022 


Max Jg{1 ni }Xij 


0.581 


0.664 


0.320 


0.734 


0.070 


Max je{1> ... >n . } \Xij 


0.295 


0.450 


0.117 


0.501 


0.122 


Max je{lj . ..,„.} \Xij - n\ 


0.732 


0.877 


0.440 


0.891 


0.134 
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4. SOME COMMENTS ON THE PAPER AND 
DISCUSSION 

From the above analysis we can see that it is useful 
to employ various discrepancies to measure the over- 
all performance and the specific assumptions of the 
model. Cross- validated posterior predictive checking 
allows the use of many discrepancies focused on var- 
ious aspects of the model and avoids the double use 
of data. It is also useful for assessing individual small 
groups or areas that are inconsistent with the model. 
Extensions to multilevel models, models with covari- 
ates and generalized linear models should be possi- 
ble. See Gelman (2004) and Gelman et al. (2005) and 
references therein for other examples of model diag- 
nostics that use flexibility in defining evaluations to 
advantage. 

The framework of test statistics only for check- 
ing models is less flexible and requires more effort; 
test statistics of BC's (2007) Section 3.3 required 
some refinement of procedures in Appendix C. The 
authors should be commended on their efforts and 
explanations; their results show a definite advantage 
over the other methods in their article in these ap- 
plications. 

The authors state that they intend the model 
checks to be preliminary in order to avoid model 
elaboration and (possibly) averaging. It seems un- 
likely to us that there would not be value in using 
such methods for further study of models past an 
initial stage. Indeed, it might be the case that un- 
usual patterns might be detectable only after mod- 
els reach a certain level of complexity. We agree with 
the authors that assessing total uncertainty through 
an elaborate model selection and refinement proce- 
dure is a challenge that deserves more study. 

An issue for future work with model assessment 
is multiplicities: the use of multiple test statistics or 
discrepancy measures to evaluate a single model and 
tests concerning individual groups. Multiplicity in 
testing will affect power and distribution of p- values. 
One could recommend selecting one discrepancy to 
assess each part of a model and avoid too much 
overlap and redundancy. We agree with BC (2007) 
that in cases with many discrepancy measures and, 
in particular, many groups, simple Bonferroni cor- 
rections might decrease power too much; in such 
cases investigation of methods from statistical ge- 
netics (small n, large p) might be helpful. As a side 
note, it would not be particularly hard to simulate 
p-value distributions and power for cross-validated 



posterior predictive p-values under the scenarios of 
BC (2007) with or without adjustment for multiplic- 
ity. 

In order to implement cross-validated posterior 
predictive checking one must sample the posterior 
distribution while leaving out groups of data. When 
the number of groups or areas is large, the com- 
putation needed for reanalyzing the model without 
each group or area could be time consuming. To 
avoid refitting the model without each group, meth- 
ods such as importance weighting and importance 
resampling could be used to approximate the poste- 
rior distribution that would be obtained if the anal- 
ysis were repeated with leaving out the group. See 
Stern and Cressie (2000), Marshall and Spiegelhal- 
ter (2003) and references therein in this regard. 

Again we wish to thank authors for a stimulating 
paper that demonstrates a method that seems quite 
effective and clearly states issues involved. 
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